60 research outputs found

    Improved Microarray-Based Decision Support with Graph Encoded Interactome Data

    Get PDF
    In the past, microarray studies have been criticized due to noise and the limited overlap between gene signatures. Prior biological knowledge should therefore be incorporated as side information in models based on gene expression data to improve the accuracy of diagnosis and prognosis in cancer. As prior knowledge, we investigated interaction and pathway information from the human interactome on different aspects of biological systems. By exploiting the properties of kernel methods, relations between genes with similar functions but active in alternative pathways could be incorporated in a support vector machine classifier based on spectral graph theory. Using 10 microarray data sets, we first reduced the number of data sources relevant for multiple cancer types and outcomes. Three sources on metabolic pathway information (KEGG), protein-protein interactions (OPHID) and miRNA-gene targeting (microRNA.org) outperformed the other sources with regard to the considered class of models. Both fixed and adaptive approaches were subsequently considered to combine the three corresponding classifiers. Averaging the predictions of these classifiers performed best and was significantly better than the model based on microarray data only. These results were confirmed on 6 validation microarray sets, with a significantly improved performance in 4 of them. Integrating interactome data thus improves classification of cancer outcome for the investigated microarray technologies and cancer types. Moreover, this strategy can be incorporated in any kernel method or non-linear version of a non-kernel method

    A kernel-based integration of genome-wide data for clinical decision support

    Get PDF
    ABSTRACT : BACKGROUND : Although microarray technology allows the investigation of the transcriptomic make-up of a tumor in one experiment, the transcriptome does not completely reflect the underlying biology due to alternative splicing, post-translational modifications, as well as the influence of pathological conditions (for example, cancer) on transcription and translation. This increases the importance of fusing more than one source of genome-wide data, such as the genome, transcriptome, proteome, and epigenome. The current increase in the amount of available omics data emphasizes the need for a methodological integration framework. METHODS : We propose a kernel-based approach for clinical decision support in which many genome-wide data sources are combined. Integration occurs within the patient domain at the level of kernel matrices before building the classifier. As supervised classification algorithm, a weighted least squares support vector machine is used. We apply this framework to two cancer cases, namely, a rectal cancer data set containing microarray and proteomics data and a prostate cancer data set containing microarray and genomics data. For both cases, multiple outcomes are predicted. RESULTS : For the rectal cancer outcomes, the highest leave-one-out (LOO) areas under the receiver operating characteristic curves (AUC) were obtained when combining microarray and proteomics data gathered during therapy and ranged from 0.927 to 0.987. For prostate cancer, all four outcomes had a better LOO AUC when combining microarray and genomics data, ranging from 0.786 for recurrence to 0.987 for metastasis. CONCLUSIONS : For both cancer sites the prediction of all outcomes improved when more than one genome-wide data set was considered. This suggests that integrating multiple genome-wide data sources increases the predictive performance of clinical decision support models. This emphasizes the need for comprehensive multi-modal data. We acknowledge that, in a first phase, this will substantially increase costs; however, this is a necessary investment to ultimately obtain cost-efficient models usable in patient tailored therapy

    Improved modeling of clinical data with kernel methods

    Get PDF
    Objective: Despite the rise of high-throughput technologies, clinical data such as age, gender and medical history guide clinical management for most diseases and examinations. To improve clinical management, available patient information should be fully exploited. This requires appropriate modeling of relevant parameters. Methods: When kernel methods are used, traditional kernel functions such as the linear kernel are often applied to the set of clinical parameters. These kernel functions, however, have their disadvantages due to the specific characteristics of clinical data, being a mix of variable types with each variable its own range. We propose a new kernel function specifically adapted to the characteristics of clinical data. Results: The clinical kernel function provides a better representation of patients' similarity by equalizing the influence of all variables and taking into account the range r of the variables. Moreover, it is robust with respect to changes in r. Incorporated in a least squares support vector machine, the new kernel function results in significantly improved diagnosis, prognosis and prediction of therapy response. This is illustrated on four clinical data sets within gynecology, with an average increase in test area under the ROC curve (AUC) of 0.023, 0.021, 0.122 and 0.019, respectively. Moreover, when combining clinical parameters and expression data in three case studies on breast cancer, results improved overall with use of the new kernel function and when considering both data types in a weighted fashion, with a larger weight assigned to the clinical parameters. The increase in AUC with respect to a standard kernel function and/or unweighted data combination was maximum 0.127, 0.042 and 0.118 for the three case studies. Conclusion: For clinical data consisting of variables of different types, the proposed kernel function which takes into account the type and range of each variable - has shown to be a better alternative for linear and non-linear classification problems. (C) 2011 Elsevier B.V. All rights reserved

    L2-norm multiple kernel learning and its application to biomedical data fusion

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper introduces the notion of optimizing different norms in the dual problem of support vector machines with multiple kernels. The selection of norms yields different extensions of multiple kernel learning (MKL) such as <it>L</it><sub>∞</sub>, <it>L</it><sub>1</sub>, and <it>L</it><sub>2 </sub>MKL. In particular, <it>L</it><sub>2 </sub>MKL is a novel method that leads to non-sparse optimal kernel coefficients, which is different from the sparse kernel coefficients optimized by the existing <it>L</it><sub>∞ </sub>MKL method. In real biomedical applications, <it>L</it><sub>2 </sub>MKL may have more advantages over sparse integration method for thoroughly combining complementary information in heterogeneous data sources.</p> <p>Results</p> <p>We provide a theoretical analysis of the relationship between the <it>L</it><sub>2 </sub>optimization of kernels in the dual problem with the <it>L</it><sub>2 </sub>coefficient regularization in the primal problem. Understanding the dual <it>L</it><sub>2 </sub>problem grants a unified view on MKL and enables us to extend the <it>L</it><sub>2 </sub>method to a wide range of machine learning problems. We implement <it>L</it><sub>2 </sub>MKL for ranking and classification problems and compare its performance with the sparse <it>L</it><sub>∞ </sub>and the averaging <it>L</it><sub>1 </sub>MKL methods. The experiments are carried out on six real biomedical data sets and two large scale UCI data sets. <it>L</it><sub>2 </sub>MKL yields better performance on most of the benchmark data sets. In particular, we propose a novel <it>L</it><sub>2 </sub>MKL least squares support vector machine (LSSVM) algorithm, which is shown to be an efficient and promising classifier for large scale data sets processing.</p> <p>Conclusions</p> <p>This paper extends the statistical framework of genomic data fusion based on MKL. Allowing non-sparse weights on the data sources is an attractive option in settings where we believe most data sources to be relevant to the problem at hand and want to avoid a "winner-takes-all" effect seen in <it>L</it><sub>∞ </sub>MKL, which can be detrimental to the performance in prospective studies. The notion of optimizing <it>L</it><sub>2 </sub>kernels can be straightforwardly extended to ranking, classification, regression, and clustering algorithms. To tackle the computational burden of MKL, this paper proposes several novel LSSVM based MKL algorithms. Systematic comparison on real data sets shows that LSSVM MKL has comparable performance as the conventional SVM MKL algorithms. Moreover, large scale numerical experiments indicate that when cast as semi-infinite programming, LSSVM MKL can be solved more efficiently than SVM MKL.</p> <p>Availability</p> <p>The MATLAB code of algorithms implemented in this paper is downloadable from <url>http://homes.esat.kuleuven.be/~sistawww/bioi/syu/l2lssvm.html</url>.</p

    Modeling precision treatment of breast cancer

    Get PDF
    Background: First-generation molecular profiles for human breast cancers have enabled the identification of features that can predict therapeutic response; however, little is known about how the various data types can best be combined to yield optimal predictors. Collections of breast cancer cell lines mirror many aspects of breast cancer molecular pathobiology, and measurements of their omic and biological therapeutic responses are well-suited for development of strategies to identify the most predictive molecular feature sets. Results: We used least squares-support vector machines and random forest algorithms to identify molecular features associated with responses of a collection of 70 breast cancer cell lines to 90 experimental or approved therapeutic agents. The datasets analyzed included measurements of copy number aberrations, mutations, gene and isoform expression, promoter methylation and protein expression. Transcriptional subtype contributed strongly to response predictors for 25% of compounds, and adding other molecular data types improved prediction for 65%. No single molecular dataset consistently out-performed the others, suggesting that therapeutic response is mediated at multiple levels in the genome. Response predictors were developed and applied to TCGA data, and were found to be present in subsets of those patient samples. Conclusions: These results suggest that matching patients to treatments based on transcriptional subtype will improve response rates, and inclusion of additional features from other profiling data types may provide additional benefit. Further, we suggest a systems biology strategy for guiding clinical trials so that patient cohorts most likely to respond to new therapies may be more efficiently identified

    A taxonomy of epithelial human cancer and their metastases

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray technology has allowed to molecularly characterize many different cancer sites. This technology has the potential to individualize therapy and to discover new drug targets. However, due to technological differences and issues in standardized sample collection no study has evaluated the molecular profile of epithelial human cancer in a large number of samples and tissues. Additionally, it has not yet been extensively investigated whether metastases resemble their tissue of origin or tissue of destination.</p> <p>Methods</p> <p>We studied the expression profiles of a series of 1566 primary and 178 metastases by unsupervised hierarchical clustering. The clustering profile was subsequently investigated and correlated with clinico-pathological data. Statistical enrichment of clinico-pathological annotations of groups of samples was investigated using Fisher exact test. Gene set enrichment analysis (GSEA) and DAVID functional enrichment analysis were used to investigate the molecular pathways. Kaplan-Meier survival analysis and log-rank tests were used to investigate prognostic significance of gene signatures.</p> <p>Results</p> <p>Large clusters corresponding to breast, gastrointestinal, ovarian and kidney primary tissues emerged from the data. Chromophobe renal cell carcinoma clustered together with follicular differentiated thyroid carcinoma, which supports recent morphological descriptions of thyroid follicular carcinoma-like tumors in the kidney and suggests that they represent a subtype of chromophobe carcinoma. We also found an expression signature identifying primary tumors of squamous cell histology in multiple tissues. Next, a subset of ovarian tumors enriched with endometrioid histology clustered together with endometrium tumors, confirming that they share their etiopathogenesis, which strongly differs from serous ovarian tumors. In addition, the clustering of colon and breast tumors correlated with clinico-pathological characteristics. Moreover, a signature was developed based on our unsupervised clustering of breast tumors and this was predictive for disease-specific survival in three independent studies. Next, the metastases from ovarian, breast, lung and vulva cluster with their tissue of origin while metastases from colon showed a bimodal distribution. A significant part clusters with tissue of origin while the remaining tumors cluster with the tissue of destination.</p> <p>Conclusion</p> <p>Our molecular taxonomy of epithelial human cancer indicates surprising correlations over tissues. This may have a significant impact on the classification of many cancer sites and may guide pathologists, both in research and daily practice. Moreover, these results based on unsupervised analysis yielded a signature predictive of clinical outcome in breast cancer. Additionally, we hypothesize that metastases from gastrointestinal origin either remember their tissue of origin or adapt to the tissue of destination. More specifically, colon metastases in the liver show strong evidence for such a bimodal tissue specific profile.</p

    HER2 is not a cancer subtype but rather a pan-cancer event and is highly enriched in AR-driven breast tumors

    No full text
    Abstract Background Approximately one in five breast cancers are driven by amplification and overexpression of the human epidermal growth factor receptor 2 (HER2) receptor kinase, and HER2-enriched (HER2E) is one of four major transcriptional subtypes of breast cancer. We set out to understand the genomics of HER2 amplification independent of subtype, and the underlying drivers and biology of HER2E tumors. Methods We investigated published genomic data from 3155 breast tumors and 5391 non-breast tumors. Results HER2 amplification is a distinct driver event seen in all breast cancer subtypes, rather than a subtype marker, with major characteristics restricted to amplification and overexpression of HER2 and neighboring genes. The HER2E subtype has a distinctive transcriptional landscape independent of HER2A that reflects androgen receptor signaling as replacement for estrogen receptor (ER)-driven tumorigenesis. HER2 amplification is also an event in 1.8% of non-breast tumors. Conclusions These discoveries reveal therapeutic opportunities for combining anti-HER2 therapy with anti-androgen agents in breast cancer, and highlight the potential for broader therapeutic use of HER2 inhibitors

    Additional file 13: of HER2 is not a cancer subtype but rather a pan-cancer event and is highly enriched in AR-driven breast tumors

    No full text
    Coordinated expression of HER2-neighboring genes in the absence of amplification. (A-H) Expression of genes in the core HER2 amplicon (PGAP3, ERBB2, MIEN1, GRB7), representative genes in the broad HER2 amplicon (MED1, CDK12, NR1D1), and TOP2A (more telomeric on 17q), in HER2A tumors (red), non-HER2A tumors without HER2 overexpression (black), and non-HER2A tumors with HER2 overexpression (o/e, log2 nRPKM + 1 ≥ 8.2; green). (A) Seven non-HER2A, o/e breast tumors. (B) Six non-HER2A, o/e gastric tumors. (C) Two non-HER2A, o/e endometrial tumors. (D) Two non-HER2A, o/e cervix tumors. (E) Two non-HER2A, o/e bladder tumors. (F) Two non-HER2A, o/e lung squamous cell carcinoma tumors. (G) One non-HER2A, o/e lung adenocarcinoma tumor. (H) One non-HER2A, o/e ovarian tumor. (I) Relative copy number levels for broad HER2 amplicon genes in 23 non-HER2A o/e tumors. Relative copy number levels exceed 2 (or are borderline at 1.9) in 6 out of 23 tumors. Tumors are colored by cancer, and genes in the broad HER2 amplicon are colored as per Fig. 2a. (J-M) Average log2 ratio of methylated to unmethylated intensity of CpG probes near HER2 and its closest neighbors PGAP3, MIEN1 and GRB7 (in the gene body or maximum 2 kb upstream of the transcription start site) with Kruskal-Wallis test FDR p value below the indicated value for a four-group comparison: HER2A, non-o/e; HER2A, o/e; non-HER2A, non-o/e; non-HER2A, o/e. (J) For breast cancer, included are 27 methylation probes with Kruskal-Wallis FDR p < 1e-15. (K) For gastric cancer, included are 37 methylation probes with p < 1e-5. (L) For cervix cancer, included are 28 methylation probes with p < 0.01. (M) For bladder cancer, included are 21 methylation probes with p < 0.01. (PDF 571 kb

    Maximum Likelihood Estimation of GEVD: Applications in Bioinformatics

    No full text
    corecore